Paraphrase Pattern Acquisition by Diversifiable Bootstrapping

نویسنده

  • Hideki Shima
چکیده

Texts that convey the same or close meaning can be written in many different ways. Because of this, computer programs are not good at recognizing meaning similarity between short texts. Toward solving this problem, researchers have been investigating methods for automatically acquiring paraphrase templates (paraphrase extraction) from a corpus. State-of-the-art approaches in paraphrase extraction have limited ability to detect variation (e.g. “X died of Y ”, “X has died of Y ”, “X was dying of Y ”, “X died from Y ”, “X was killed in Y ”). Considering practical usage, for instance in Information Extraction, a paraphrase resource should ideally have higher coverage so that it can recognize more ways to convey the same meaning in text (e.g. “X succumbed to Y ”, “X fell victim to Y ”, “X suffered a fatal Y ”, “X was terminally ill with Y ”, “X lost his long battle with Y ”, “X(writer) wrote his final chapter Y ”), without adding noisy patterns or instances that convey a different meaning than the original seed meaning (semantic drift). The goal of this thesis work is to develop a paraphrase extraction algorithm that can acquire lexically-diverse binary-relation paraphrase templates, given a relatively small number of seed instances for a certain relation and an unstructured monolingual corpus. The proposed algorithm runs in an iterative fashion, where the seed instances are used to extract paraphrase patterns, and then these patterns are used to extract more seed instances to be used in the next iteration, and so on. The proposed work is unique in a sense that lexical diversity of resulting paraphrase patterns can be controlled with a parameter, and that semantic drift is deferred by identifying erroneous instances using a distributional type model. We also propose a new metric DIMPLE which can measure quality of paraphrases, taking lexical diversity into consideration. Our hypothesis is that such a model that explicitly controls diversity and includes a distributional type constraint will outperform the state-of-the-art as measured by precision, recall, and DIMPLE. We also present experimental results to support this hypothesis. Acknowledgments First and foremost, I would like to express the sincere gratitude to my academic advisor, Dr. Teruko Mitamura. She has been extremely supportive for the entire years of my graduate student life. My academic achievements including this dissertation work would not have been possible without the assistance of her. I cannot thank her enough for everything she has done for me. I would like to extend my deep appreciation to Dr. Eric Nyberg. I learned a lot from him through research experiences in multiple projects he led, especially the Javelin QA project, and four courses I assisted as a TA. Through working closely together for years, his philosophy in science and engineering impacted me largely. I am truly grateful for the support by my committee members, Dr. Eduard Hovy and Dr. Patrick Pantel. I am greatly honored to be able to receive eye-opening advice and research questions from the most recognized and enthusiastic experts I can imagine in the field. My appreciation also goes to many people for their help in developing my academic career. Dr. Kemal Oflazer at CMU in Qatar gave me an opportunity to expand my knowledge in technology-driven education. Dr. Noriko Kando at NII in Japan mentored me in organizing a few NTCIR tasks. DeepQA/Watson team at IBM Research, especially Dr. James Fan, Dr. J. William Murdock, Dr. Adam Lally, Dr. Chris Welty, Dr. Hiroshi Kanayama, Dr. Takeda Koichi and Dr. Eric W. Brown, gave me an exciting opportunity to be a part of Jeopardy! grand-challenge as well as the Machine Reading project. I thank faculty members, staff members and colleagues in CMU, especially Dr. Jaime Carbonell for leading and continuing to improve the world’s top-level education and research organization, the Language Technologies Institute. My thanks also go to Ms. Susan Holm for helping us to create a part of data used in this thesis. Finally, I would like to thank my family and friends for their patience, love and support for many years.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Diversifiable Bootstrapping for Acquiring High-Coverage Paraphrase Resource

Recognizing similar or close meaning on different surface form is a common challenge in various Natural Language Processing and Information Access applications. However, we identified multiple limitations in existing resources that can be used for solving the vocabulary mismatch problem. To this end, we will propose the Diversifiable Bootstrapping algorithm that can learn paraphrase patterns wi...

متن کامل

The Development of Human 1

The Development of Human 2 2 Abstract This chapter discusses a uniquely human learning mechanism—bootstrapping-whereby external symbolic systems (especially language) enable the creation of new, internal representational resources. The process is described in general terms and is also illustrated with a specific case: The acquisition of concepts for positive integers (1, 2, 3, etc.). First, the...

متن کامل

MIPA: Mutual Information Based Paraphrase Acquisition via Bilingual Pivoting

We present a pointwise mutual information (PMI) based approach for formalizing paraphrasability and propose a variant of PMI, called mutual information based paraphrase acquisition (MIPA), for paraphrase acquisition. Our paraphrase acquisition method first acquires lexical paraphrase pairs by bilingual pivoting and then reranks them by PMI and distributional similarity. The complementary nature...

متن کامل

Interrogative Reformulation Patterns and Acquisition of Question Paraphrases

We describe a set of paraphrase patterns for questions which we derived from a corpus of questions, and report the result of using them in the automatic recognition of question paraphrases. The aim of our paraphrase patterns is to factor out different syntactic variations of interrogative words, since the interrogative part of a question adds a syntactic superstructure on the sentence part (i.e...

متن کامل

An XML-Based Bootstrapping Method for Pattern Acquisition

Extensible Markup Language (XML) has been widely used as a middleware because of its flexibility. Fixed domain is one of the bottlenecks of Information Extraction (IE) technologies. In this paper we present a XML-based domain-adaptable bootstrapping method of pattern acquisition, which focuses on minimizing the cost of domain migration. The approach starts from a seed corpus with some seed patt...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012